NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Bidirectional subsethood of shared marker profiles enables accurate virus classification

https://doi.org/10.1186/s40168-025-02159-x

Riccardi, Christopher; Wang, Yuqiu; Yooseph, Shibu; Sun, Fengzhu (July 2025, Microbiome)
Model Selection for Sparse Microbial Network Inference using Variational Approximation

Yooseph, Shibu (March 2025, Springer Verlag)

Microbial communities are often composed of taxa from different taxonomic groups. The associations among the constituent members in a microbial community play an important role in determining the functional characteristics of the community, and these associations can be modeled using an edge weighted graph (microbial network). A microbial network is typically inferred from a sample–taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa abundance in each sample. Motivated by microbiome studies that involve a large number of samples collected across a range of study parameters, here we consider the computational problem of identifying the number of microbial networks underlying the observed sample-taxa abundance matrix. Specifically, we consider the problem of determing the number of sparse microbial networks in this setting. We use a mixture model framework to address this problem, and present formulations to model both count data and proportion data. We propose several variational approximation based algorithms that allow the incorporation of the sparsity constraint while estimating the number of components in the mixture model. We evaluate these algorithms on a large number of simulated datasets generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).
more » « less
Free, publicly-accessible full text available March 31, 2026
Variational Approximation-Based Model Selection for Microbial Network Inference

https://doi.org/10.1089/cmb.2021.0595

Yooseph, Shibu; Tavakoli, Sahar (July 2022, Journal of Computational Biology)

Full Text Available
Integrated de novo gene prediction and peptide assembly of metagenomic sequencing data

https://doi.org/10.1093/nargab/lqad023

Thippabhotla, Sirisha; Liu, Ben; Podgorny, Adam; Yooseph, Shibu; Yang, Youngik; Zhang, Jun; Zhong, Cuncong (March 2023, NAR Genomics and Bioinformatics)

Abstract Metagenomics is the study of all genomic content contained in given microbial communities. Metagenomic functional analysis aims to quantify protein families and reconstruct metabolic pathways from the metagenome. It plays a central role in understanding the interaction between the microbial community and its host or environment. De novo functional analysis, which allows the discovery of novel protein families, remains challenging for high-complexity communities. There are currently three main approaches for recovering novel genes or proteins: de novo nucleotide assembly, gene calling and peptide assembly. Unfortunately, their information dependency has been overlooked, and each has been formulated as an independent problem. In this work, we develop a sophisticated workflow called integrated Metagenomic Protein Predictor (iMPP), which leverages the information dependencies for better de novo functional analysis. iMPP contains three novel modules: a hybrid assembly graph generation module, a graph-based gene calling module, and a peptide assembly-based refinement module. iMPP significantly improved the existing gene calling sensitivity on unassembled metagenomic reads, achieving a 92–97% recall rate at a high precision level (>85%). iMPP further allowed for more sensitive and accurate peptide assembly, recovering more reference proteins and delivering more hypothetical protein sequences. The high performance of iMPP can provide a more comprehensive and unbiased view of the microbial communities under investigation. iMPP is freely available from https://github.com/Sirisha-t/iMPP.
more » « less
Variational Approximation based Model Selection for Microbial Network Inference

Yooseph, Shibu; Tavakoli, Sahar (January 2022, Journal of computational biology)
Singh, Mona (Ed.)
Microbial associations are characterized by both direct and indirect interactions between the constituent taxa in a microbial community, and play an important role in determining the structure, organization, and function of the community. Microbial associations can be represented using a weighted graph (microbial network) whose nodes represent taxa and edges represent pairwise associations. A microbial network is typically inferred from a sample-taxa matrix that is obtained by sequencing multiple biological samples and identifying the taxa counts in each sample. However, it is known that microbial associations are impacted by environmental and/or host factors. Thus, a sample-taxa matrix generated in a microbiome study involving a wide range of values for the environmental and/or clinical metadata variables may in fact be associated with more than one microbial network. Here we consider the problem of inferring multiple microbial networks from a given sample-taxa count matrix. Each sample is a count vector assumed to be generated by a mixture model consisting of component distributions that are Multivariate Poisson Log-Normal. We present a variational Expectation Maximization algorithm for the model selection problem to infer the correct number of components of this mixture model. Our approach involves reframing the mixture model as a latent variable model, treating only the mixing coefficients as parameters, and subsequently approximating the marginal likelihood using an evidence lower bound framework. Our algorithm is evaluated on a large simulated dataset generated using a collection of different graph structures (band, hub, cluster, random, and scale-free).
more » « less
Full Text Available
De Novo Genome Assembly Highlights the Role of Lineage-Specific Gene Duplications in the Evolution of Venom in Fea's Viper ( Azemiops feae )

https://doi.org/10.1093/gbe/evac082

Myers, Edward A; Strickland, Jason L; Rautsaw, Rhett M; Mason, Andrew J; Schramer, Tristan D; Nystrom, Gunnar S; Hogan, Michael P; Yooseph, Shibu; Rokyta, Darin R; Parkinson, Christopher L (July 2022, Genome Biology and Evolution)
Qian, Wenfeng (Ed.)
Abstract Despite the medical significance to humans and important ecological roles filled by vipers, few high-quality genomic resources exist for these snakes outside of a few genera of pitvipers. Here we sequence, assemble, and annotate the genome of Fea’s Viper (Azemiops feae). This taxon is distributed in East Asia and belongs to a monotypic subfamily, sister to the pitvipers. The newly sequenced genome resulted in a 1.56 Gb assembly, a contig N50 of 1.59 Mb, with 97.6% of the genome assembly in contigs >50 Kb, and a BUSCO completeness of 92.4%. We found that A. feae venom is primarily composed of phospholipase A2 (PLA2) proteins expressed by genes that likely arose from lineage-specific PLA2 gene duplications. Additionally, we show that renin, an enzyme associated with blood pressure regulation in mammals and known from the venoms of two viper species including A. feae, is expressed in the venom gland at comparative levels to known toxins and is present in the venom proteome. The cooption of this gene as a toxin may be more widespread in viperids than currently known. To investigate the historical population demographics of A. feae, we performed coalescent-based analyses and determined that the effective population size has remained stable over the last 100 kyr. This suggests Quaternary glacial cycles likely had minimal influence on the demographic history of A. feae. This newly assembled genome will be an important resource for studying the genomic basis of phenotypic evolution and understanding the diversification of venom toxin gene families.
more » « less
Full Text Available
GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data

https://doi.org/10.1186/s12859-019-2818-1

Zhong, Cuncong; Yang, Youngik; Yooseph, Shibu (June 2019, BMC Bioinformatics)

Full Text Available
Global ecotypes in the ubiquitous marine clade SAR86

https://doi.org/10.1038/s41396-019-0516-7

Hoarfrost, Adrienne; Nayfach, Stephen; Ladau, Joshua; Yooseph, Shibu; Arnosti, Carol; Dupont, Chris L.; Pollard, Katherine S. (October 2019, The ISME Journal)

Abstract SAR86 is an abundant and ubiquitous heterotroph in the surface ocean that plays a central role in the function of marine ecosystems. We hypothesized that despite its ubiquity, different SAR86 subgroups may be endemic to specific ocean regions and functionally specialized for unique marine environments. However, the global biogeographical distributions of SAR86 genes, and the manner in which these distributions correlate with marine environments, have not been investigated. We quantified SAR86 gene content across globally distributed metagenomic samples and modeled these gene distributions as a function of 51 environmental variables. We identified five distinct clusters of genes within the SAR86 pangenome, each with a unique geographic distribution associated with specific environmental characteristics. Gene clusters are characterized by the strong taxonomic enrichment of distinct SAR86 genomes and partial assemblies, as well as differential enrichment of certain functional groups, suggesting differing functional and ecological roles of SAR86 ecotypes. We then leveraged our models and high-resolution, remote sensing-derived environmental data to predict the distributions of SAR86 gene clusters across the world’s oceans, creating global maps of SAR86 ecotype distributions. Our results reveal that SAR86 exhibits previously unknown, complex biogeography, and provide a framework for exploring geographic distributions of genetic diversity from other microbial clades.
more » « less
Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea

https://doi.org/10.1038/nbt.3893

Bowers, Robert M; Kyrpides, Nikos C; Stepanauskas, Ramunas; Harmon-Smith, Miranda; Doud, Devin; Reddy, T B; Schulz, Frederik; Jarett, Jessica; Rivers, Adam R; Eloe-Fadrosh, Emiley A; et al (August 2017, Nature Biotechnology)

Full Text Available

Search for: All records